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Abstract — We consider a learning problem of identifying a dic- 
tionary matrix D £ R M x w from a sample set of M dimensional 
vectors Y € R MxP = N~ 1/2 DX e R MxP , where X G R JVxP 
is a sparse matrix in which the density of non-zero entries is 
< p < 1. In particular, we focus on the minimum sample size 
P c (sample complexity) necessary for perfectly identifying D of 
the optimal learning scheme when D and X are independently 
generated from certain distributions. By using the replica method 
of statistical mechanics, we show that P c ~ O(N) holds as long 
as a = M /N > p is satisfied in the limit of N —>■ oo. Our analysis 
also implies that the posterior distribution given Y is condensed 
only at the correct dictionary D when the compression rate 
\a is greater than a certain critical value an(p). This suggests 
that belief propagation may allow us to learn D with a low 
computational complexity using O(N) samples. 

I. Introduction 

The concept of sparse representations has recently attracted 
considerable attention from various fields in which the number 
of measurements is limited. Many real-world signals such 
as natural images are represented sparsely in Fourier/wavelet 
domains; in other words, many components vanish or are 
negligibly small in amplitude when the signals are represented 
by Fourier/wavelet bases. This empirical property is exploited 
in the signal recovery paradigm of compressed sensing (CS), 
thereby enabling the recovery of sparse signals from much 
fewer measurements than those estimated by the Nyquist- 
Shannon sampling theorem Q], (2), Q, (4). 

In signal processing techniques for exploiting sparsity, sig- 
nals are generally assumed to be described as linear combina- 
tions of a few dictionary atoms. Therefore, the effectiveness of 
this approach is highly dependent on the choice of dictionary, 
by which the objective signals appear sparse. A method for 
choosing an appropriate dictionary for sparse representation 
is dictionary learning (DL), whereby the dictionary is con- 
structed through a learning process from an available set of P 
training samples 0, E), 0, H). 

The ambiguity of the dictionary is fatal in signal/data 
analysis after learning. Therefore, an important issue is the 
estimation of the sample complexity, i.e., the sample size 
P c necessary for correct identification of the dictionary. In 
a seminal work, Aharon et al. showed that when the training 
set Y € R MxP is generated by a dictionary D G R MxN and 
a sparse matrix X G R JVxP (planted solution) as Y = DX, 
one can perfectly learn these if P > P c = (k + l)pfCk and k is 



sufficiently small, where k is the number of non-zero elements 
in each column of X |9). Unfortunately, this bound becomes 
exponentially large in N for k ~ O(N), which motivates us 
to improve the estimation. A recent rigorous study has shown 
that the mean squared error of recovered signals (per element) 
e after learning can scale as e ~ O (Nln(kP)/P), which can 
be read as P c ~ O(NlnN) (TO). However, the expression still 
leads to a natural question: is the logarithmic factor intrinsic 
or not? 

To answer this question, in this study, we evaluate the 
sample complexity of the optimal learning scheme defined 
for a given probabilistic model of dictionary learning. In a 
previous study, the authors assessed the sample complexity 
for a naive learning scheme: mirij^x 1 1^ ~ JSC 1 1 2 subj. 
to ||X|| < NPp (0 < p < lj, where ||X|| is the 
number of non-zero elements in X and D is enforced to 
be normalized appropriately. They used the replica method of 
statistical mechanics and found that P c ~ O(N) holds when 
a = M/N is greater than a certain critical value a na ive(/o) > P 
ifTTI . However, the smallest possible P c that can be obtained 
for a < a na ivo(p) has not been clarified thus far. In this study, 
we show that P c ~ O(N) holds in the entire region of a > p 
for the optimal learning scheme. 

II. Problem setup 

Let us suppose the following scenario of dictionary learning. 
Planted solutions, an M x N dictionary matrix D <S R MxN 



and an N x P sparse matrix X G 
generated from prior distributions 



pNxP 



M), 



are independently 



(1) 



P P (X) = Y[P P (X U ) = 1[{(1 - p)S(X a ) + pf(X a )}, (2) 

z,/ i,l 

respectively, where Md is a normalization constant and p G 
[0, 1] is the rate of non-zero elements in X. The distribution 
function f(X) does not have a finite mass probability at the 
origin. The set of training samples Y G M MxP , whose column 
vector corresponds to a training sample, is assumed to be given 
by the planted solutions as 

Y = -^M, (3) 



N 



where 1/v N is introduced for convenience in taking the large- 
system limit. A learner is required to infer D and X from Y. 
Our aim is to evaluate the minimum value of the sample size 
P required for perfectly identifying D and X. 

III. Bayesian optimal learning 

For mathematical formulation of our problem, let us denote 
the estimates of D and X yielded by an arbitrary learning 
scheme as D(Y) and X(Y). We evaluate the efficiency of 
the scheme using the mean squared errors (per element), 



MSE C (£>(•)): 



1 
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P P (P,X,Y)\\D-D(Y) 



MSE x (X(-)) = ^p Yl Pp(D,X,Y)\\X-X(Y)\\ 2 



(4) 



(5) 



y,d,x 



where A ■ B = ^ . AijBij represents the inner product 
between two matrices of the same dimension A and B, and 



I All = \/ A ■ A indicates the Frobenius norm of A. We 



impose the normalization constraint Y^^=i(^0^))% 
for each column index i = 1,2, ... ,N in order to avoid the 
ambiguity of the product D(Y)X(Y) = D(Y)A~ 1 AX{Y) 
for an arbitrary invertible diagonal matrix A 0. The joint 
distribution of D, X, and Y is given by 



M 



P P (D, X, Y) = S(Y - -=DX)P(D)P P (X 



(6) 



The perfect identification of D and X can be characterized 
by MSE D = MSE X = 0. The following theorem offers a 
useful basis for answering our question. 

Theorem 1. For an arbitrary learning scheme, (0 and (0 
are bounded from below as 



MSE D (D(-)) > 2 - 2 £ P P (Y) f 1 J2 ^ 

Y V t=l Vj 



MSE X (X(.)) > J2 P p( Y i 

Y 



NP 




where P p (Y) = ^2 DX P p (D,X,Y), and v/p 
average over D and X according to the posterior distri- 
bution of D and X under a given Y, P p (D, X\Y) = 
P p (D, X, Y) / P p {Y). The equalities hold when the estimates 
satisfy 



(D P\Y)) t 



,— ((D) )i 



'Additionally, D(Y)X(Y) is invariant under any simultaneous permu- 
tations of columns in D(Y) and rows in X(Y), which yields an JV! 
degeneracy of an intrinsically identical solution. However, this does not 
influence the results of the current analysis since the number of degeneracy 
N\ is negligible in the saddle point assessment of [Pg(Y)]y which scales 
exponentially in N 2 . 



where (A)i denotes the i-th column vector of matrix A. We 
refer to ((9]) as the Bayesian optimal learning scheme M2V . 

Proof: By applying the Cauchy-Shwartz inequality and the 
minimization of the quadratic function to MSEjj and MSEx, 
respectively, one can obtain Q-© after inserting the expres- 
sion 

£ xP p (D, X, Y) = P p (Y)J2xP p (D, X\Y) = P p (Y){x) p 
d.x n.x 

for x — D and X into ©. □ 

This theorem guarantees that when the setup of dictionary 
learning is characterized by (JT} — ([3j, the estimates of (|9) offer 
the best possible learning performance in the sense that (|4|l 
and ([5]) are minimized. As the perfect identification of D and 
X is characterized by MSEo = MSEx = 0, our purpose is 
fulfilled by analyzing the performance of the Bayesian optimal 
learning scheme of ([5). 

IV. Analysis 

For simplicity of calculation, let us set f(Xu) as the 
Gaussian distribution with mean and variance a\, and 
a x is set to unity for all numerical calculations later on. 
For generality, we consider cases in which the sparsity as- 
sumed by the learner, denoted as 6, can differ from the 
actual value p. When 9 ^ p, the estimates are given by 
[D(Y))i = v / M(( J D) e ) l /||((£>) e ),:|| and X(Y) = (X) e 
instead of ©. To evaluate MSE^ and MSE X , we need to 
evaluate macroscopic quantities 

lD = ^[(D) s -(D)e]Y, m D = ^[(D) s -(D) p ] Y (10) 



MN' 

Qx = -^-[(x-x) e ] y 



NP 1 

-4t;1(x) 



(X)o]y, m x 



MN 1 



1 



(11) 



— {(X) e -(X) p ] Yl (12) 



NP 

where [-] Y = £y -FpCp(-)- Note that GS)-C3 yield 
MSEd ~ 2 - 2m D /^/qrf\ and MSE X = pa 2 x + q x - 2m x . 

Unfortunately, evaluating these is intrinsically difficult be- 
cause it generally requires averaging the quantity 

ggS^g^ P e (Y,D\X 1 )P 6 (Y,D 2 ,X 2 )(D 1 ■ D 2 ) 

E D ^.^^ MY \D\X^)P 6 (Y \D\X*) 
(= (D)e- (D)e), (13) 

which includes summations over exponentially many terms in 
the denominator, with respect to Y . One promising approach 
for avoiding this difficulty involves multiplying Pq(Y) = 
(Y jXtD Pe(Y,D,X)) n (n = 2,3,... € N) inside the 
operation of [-]y for canceling the denominator of (Q~3), which 
makes the evaluation of a modified average 

1 [P?{Y)(D)e-(D) e ] Y 



Qd (n) 



MN 



[P?(Y)]y 



(14) 



2 Naive computation requires us to assess a column-wise overlap Co ; = 
M- 1 / 2 [({D) p )((D)g),./^/({D}g)r({D}g) l } Y for each column index'i = 
1, 2, . . . , N. However, the law of large numbers and the statistical uniformity 
allow the simplification of Co,i — > fnn/ ^/QD as M = aN tends to infinity. 
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Fig. 1. 7-dependence of (a) MSE D and (b) MSE X at a = 0.5, p = 0.2. 
Plots for 6 = p, O.Sp, and 1.5/a show the optimality of the correct parameter 
choice 9 = p. Broken curves represent locally unstable branches, which are 
thermodynamically irrelevant. 



feasible via the saddle point assessment of [Pq(Y)]y for 
N,M,P -> oo, keeping a = M/N and 7 = P/N as 0(1). 
Furthermore, the resulting expression is likely to hold for 
n G R as well. Therefore, we evaluate qjy using the formula 
qo = lim n ^o QD(n) with the expression, and similarly, for 
niD, Qx, qx, and rax- This procedure is often termed 
the replica method Q~3], lfT4l . Under the replica symmetric 
ansatz, which assumes that the dominant saddle point in the 
evaluation is invariant under any permutation of replica indices 
a = 1,2, ... ,n, the assessment is reduced to evaluating the 
extremum of the free entropy (density) function 



7 



QxQ 



x + qxqx 



m x mx + ((lnSx))) 



D + qDqD - 2m D m D - ln(Q D + <?d) + 



ai f qpqx-'2m D mx+ po\ . fn ,\ K 

-rn 7\ \-MQx-qDqx)>, (15) 

2 L Qx - qnqx > 



qp+m D 

QD + qD 



where ax = 14 

Sx=(l-6 
= (1- 



(Q: 



qnqx 
x + qx)o\, 



'ax 



: exp 



2a x 



I 



(16) 
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Fig. 2. 7-dependence of qo (left axis) and qx (right axis) for a ■■ 
p = 0.2. 



0.5 and 



and 



and z, which are 
Gaussian distribu- 



((•)) denotes the average over X 
distributed according to P P (X) and a 
tion with mean zero and variance 1, respectively. The 
extremized valu^l of <fi, 0*, is related to the average log- 
likelihood (density) of Y as N^Y^y Pp{Y)\nP e {Y) = 
lim n ^ (d/dn) {N~ 2 ln[P£(Y)] Y } = (f>* + constant. 

In Fig. [j] (a) MSE D and (b) MSE X for 6 = p = 0.2 
are plotted versus 7 together with those for 9 = 0.8p and 
9 = 1.5p. At 6 = p, MSE D and MSE X of thermody- 
namically relevant branches have minimum values in the 
entire 7 region, while a branch of solution characterized by 
MSE/j = MSEx = is shared by the three parameter sets. 
This supports the optimality of the correct parameter choice 
of 9 = p, and therefore, we hereafter focus our analysis on 
this case to estimate the minimum value of 7 for the perfect 
learning, MSE D = MSE X = 0. At 6 = p, the relationships 
mD = qD, mx = qx, and Qx = p hold from ([T0]i-(fT2li. and 
the extremum problem is reduced to 



qD 



Id 



1 



qD 



qx 




where qo and qx are given by 

aq D 



qx = 



P<4 



qoqx 



qD 



iqx 



P a x 



qoqx 



(17) 



(18) 



The other variables are provided as Qn = 1, Qx = 0, m x = 
qx, and m D = q D . 

V. Results 

A. Actual solutions 

Fig- El plots qD and qx versus 7 for a — 0.5 and p = 0.2. 
As shown in the figure, the solutions of qry and qx given 
by ( fTTI i are classified into three types: qr> = 1, qx = P G \, 

3 When multiple extrema exist, the maximum value among them should be 
chosen as long as no consistency condition is violated. 
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Fig. 3. 7m versus p for a = 0.5. 



Fig. 4. 7-dependence of q 

for 7 > 7s = a/ {a — p) 



for a = 0.5 and p ■■ 
= 1.666. . .. 



: 0.2. (fis diverges positively 



qD = qx = 0, and < q D < 1, < q x < po\. The 
first one yields MSEd = MSEx = 0, indicating the correct 
identification of D and X, and hence, we name it the success 
solution. The second one is referred to as the failure solution 
because it yields MSE_d = 2 and MSEjc = po\, which 
indicates complete failure of the learning of D and X. The 
third one yields finite MSE D and MSE X , < MSE D < 
2, < MSEjc < po\, and we term it the middle solution. 
1) Success solution: When the expression 

|2 



DX\ ( 1 \MP 

HY—=)= lim l-^=) 



cxpl 



VW DX \ 



'It 



(19) 



is used, the success solution of qo and qx behaves as (pa x — 
qx)/r = Xx and (1 - <?d)/t = \ D while q x and q D scale 
as q x — 6 x /t and qo = 9o/t. By substituting them into the 
equations of qo and q x , they are given by 

§ D = — , (20) 



PI 
9 



po\g 



P_ 

Xx 



XD 



where g = (a — p)j — a. xx and xd must be positive by 
definition, and hence, the success solution exists for 



7 > 



7s 



(21) 



a — p 

only when a > p. 

2) Failure solution: The failure solution qn = q x = 
appears at < 7 < 7f as a locally stable solution. When qo 
and q x are sufficiently small, they are expressed as 

iqx 



qx = po x aq D + 0(q ), q D = 



P°x 



0(q< 



(22) 



where 0(q 2 ) denotes the higher-order terms over second-order 
with respect to qo and q x . These expressions indicate that 
when 



7 > a = 7p, 



(23) 



the local stability of qo = qx = is lost. As shown in Fig. [2] 
the failure solution vanishes at 7f = 2.0 for a = 0.5. 



3) Middle solution: We define tm over which the middle 
solution with < qo < 1 and < q x < pa 2 x disappears, 
denoted as a vertical line in Fig. [2] which is provided as 
7m = 3.841 ... for the parameter choice of (a, p) — (0.5,0.2). 
The value of 7m depends on (a, p), as shown in Fig. [3] 
This figure indicates that 7m diverges at pm = 0.317. . . for 
a = 0.5. The relation between pm and a, denoted as pm{o) 
(or o<m(p))> generally accords with the critical condition that 
belief propagation (BP)-based signal recovery using the correct 
prior starts to be involved with multiple fixed points for the 
signal reconstruction problem of compressed sensing ITT61 in 
which the correct dictionary D is provided in advance. 

BP is also a potential algorithm for practically achieving the 
learning performance predicted by the current analysis because 
it is known that macroscopic behavior theoretically analyzed 
by the replica method can be confirmed experimentally for 
single instances by BP for many other systems |[T6l . ifTTl . 
|[T8l . The fact that only the success solution exists for 7 > 7m 
implies that one may be able to perfectly identify the correct 
dictionary D with a computational cost of polynomial order in 
N utilizing BP, without being trapped by other locally stable 
solutions, for a > ctyi{p)- 

B. Free entropy density 

There are three extrema of the free entropy (density), 0s, 
0p, and 0m, corresponding to the success solution, failure 
solution, and middle solution, respectively. Among them, the 
thermodynamically dominant solution that provides the correct 
evaluations of qo and q x is the one for which the value of 
free entropy is the largest. Fig. 2]plots (fis, <jfp, and 4>m versus 
7 for a = 0.5, p = 0.2, where 73 = 1.666 . . . and 7f = 2.0. 
In particular, functional forms of </>s and </>f are given by 



lim — 

7-^+0 2 



c?jln(-)-l j-arMooO- 



l-ln 



( P°x 



-7P( m 7 



in<4; 



-jH(p) 



h = + \ogpa 2 x ) + a}, 



(24) 
(25) 
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Fig. 5. Phase diagram on a — p plane. The dashed curve in the area of (I) 
is the result of II 11 . 

where r — > +0 originates from the expression of ([19} and 
H(p) = — (1 — /?)log(l — p) — plog(p). Further, (l24l shows 
that </>g diverges positively for g = (a — p)"f — a > 0, which 
guarantees that the success solution is always thermodynami- 
cally dominant for 7 > 73 ~ a/ (a — p) as qb of other solutions 
is kept finite. This leads to the conclusion that the sample 
complexity of the Bayesian optimal learning is P c = Njs, 
which is guaranteed as O(N) as long as a > p. This is the 
main consequence of the present study. 

Fig. [5] plots the phase diagram in the a — p plane. The 
union of the regions (I) and (II) represents the condition 
that the sample complexity P c is O(N), while the full curve 
of the upper boundary of (II) denotes qm(/)) above which 
BP is expected to work as an efficient learning algorithm. 
Dictionary learning is impossible in the region of (III). The 
critical condition a na i vc (p) above which the naive learning 
scheme of ifTTI can perfectly identify the planted solution by 
O(N) samples is drawn as the dashed curve for comparison. 
The considerable difference between a n aive(p) and p (or even 
c*m(p)) indicates the significance of using adequate knowledge 
of probabilistic models in dictionary learning. 

VI. Summary 

In summary, we assessed the minimum sample size re- 
quired for perfectly identifying a planted solution in dictionary 
learning (DL). For this assessment, we derived the optimal 
learning scheme defined for a given probabilistic model of DL 
following the framework of Bayesian inference. Unfortunately, 
actually evaluating the performance of the Bayesian optimal 
learning scheme involves an intrinsic technical difficulty. For 
resolving this difficulty, we resorted to the replica method of 
statistical mechanics, and we showed that the sample com- 
plexity can be reduced to O(N) as long as the compression 
rate a is greater than the density p of non-zero elements of the 
sparse matrix. This indicates that the performance of a naive 
learning scheme examined in a previous study IfTTI can be 
improved significantly by utilizing the knowledge of adequate 
probabilistic models in DL. It was also shown that when a is 



greater than a certain critical value ayiip), the macroscopic 
state corresponding to perfect identification of the planted 
solution becomes a unique candidate for the thermodynam- 
ically dominant state. This suggests that one may be able to 
learn the planted solution with a computational complexity 
of polynomial order in TV utilizing belief propagation for 
a > a M {p)- 

- Note added: After completing this study, the authors 
became aware that M9V presents results similar to those 
presented in this paper, where an algorithm for dictionary 
learning/calibration is independently developed on the basis 
of belief propagation. 
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